Atomic add with carry instruction

ABSTRACT

Processing circuitry performs processing operations specified by program instructions. An instruction decoder decodes an atomic-add-with-carry instruction AAD-DC to control the processing circuitry to perform an atomic operation of an add of an addend operand value and a data value stored in a memory to generate a result value stored in the memory and a carry value indicative of whether or not the add generated a carry out.

Some data processing systems may support atomic instructions whichaccess data values in memory and are executed such that the results ofexecuting the instruction are consistent with the instruction havingexclusive access to the data value in memory during execution of theinstruction, e.g. no other instruction can access the same data value inan overlapping fashion so as to produce a result inconsistent with theatomic instruction having had exclusive access to that data value duringits execution. Atomic instructions are used in an effort to isolate theexecution of individual instructions so that there is no inappropriateand/or undesired interaction with the execution of other instructions.

At least some embodiments of the disclosure provide apparatus forprocessing data comprising:

processing circuitry to perform processing operations specified byprogram instructions; and

an instruction decoder to decode an atomic-add-with-carry instruction tocontrol said processing circuitry to perform as an atomic operation anadd of an addend operand value and a data value stored in a storage unitin a manner consistent with exclusive access to said data value duringsaid atomic operation to generate a result value stored in said storageunit and a carry value indicative of whether said add generated a carryout.

At least some further embodiments of the disclosure provide apparatusfor processing data comprising:

processing means for performing processing operations specified byprogram instructions; and

instruction decoding means for decoding an atomic-add-with-carryinstruction to control said processing means to perform as an atomicoperation an add of an addend operand value and a data value stored in astorage unit in a manner consistent with exclusive access to said datavalue during said atomic operation to generate a result value stored insaid storage unit and a carry value indicative of whether said addgenerated a carry out.

At least some further embodiments of the disclosure provide a method ofprocessing data comprising:

performing processing operations specified by program instructions withprocessing circuitry; and

decoding an atomic-add-with-carry instruction to control said processingcircuitry to perform as an atomic operation an add of an addend operandvalue and a data value stored in a storage unit in a manner consistentwith exclusive access to said data value during said atomic operation togenerate a result value stored in said storage unit and a carry valueindicative of whether said add generated a carry out.

Example embodiments will now be described, by way of example only, withreference to the accompanying drawings in which:

FIG. 1 schematically illustrates a data processing system for executingatomic instructions;

FIG. 2 schematically illustrates a multi-processor data processingsystem having a shared memory;

FIG. 3 schematically illustrates the action of two threads eachexecuting a sequence of atomic-add-with-carry instructions accumulatingrespective different bit significance portions of an accumulated value;

FIG. 4 schematically illustrates the relative ordering of execution ofthe atomic-add-with-carry instructions illustrated in FIG. 3;

FIGS. 5A to 5G schematically illustrate the execution of a sequence ofatomic-add-with-carry instructions generating a local sum value which isthen accumulated into a data value when the data value is returned to acache memory;

FIG. 6 is a diagram schematically illustrating the use ofatomic-add-with-carry instructions in a system incorporating a pluralityof data processing apparatuses operating as a coalescing tree toaccumulate to a data value held by a root processor; and

FIG. 7 is a flow diagram schematically illustrating the operation of anode within the coalescing tree of FIG. 6.

In accordance with at least some example embodiments of the disclosurethere is provided an atomic-add-with-carry instruction which performs,as an atomic operation, an add of an addend operand and a data valuestored in a storage unit. A carry value is generated from thisatomic-add-with-carry instruction. The generation of a carry value froman atomic instruction is unusual in that it indicates that the atomicinstruction is to interact with other instructions via this carry value.This is counter to the normal philosophy whereby atomic instructions areself-contained.

The storage unit could have a variety of different forms, e.g. aregister. The storage unit may be memory mapped (e.g. associated with amemory address(es) within a memory address space). In some embodimentsthe storage unit may be a memory such as SRAM, DRAM, or similar.

Although useable and useful in a variety of different circumstances, theatomic-add-with-carry instructions may be used in some exampleembodiments of the disclosure in which the addend value and the datavalue have a shared range of bit significance and at least one of theaddend operand value and the data value is part of a larger value havinga total range of bit significance greater than and including the sharedrange of bit significance. It is thus possible to represent valuesgreater than the data width supported and manipulated natively within adata processing apparatus by breaking up the larger data value into aplurality of data values which are separately manipulated. The bitsignificance of those individual data values and the larger data valuemay be itself programmable and represented by metadata associated withthe data value concerned. In such an arrangement, the carry valuegenerated by an atomic-add-with-carry instruction allows atomicinstruction behaviour to be supported and permits the necessaryinteraction between the different portions of a data value of greaterbit significance width to be achieved via the carry value. In sucharrangements, the carry value generated by an atomic-add-with-carryinstruction may be added to an addend operand of a furtheratomic-add-with-carry instruction representing a next most significantportion of a larger data value in a manner which permits an overallatomic behaviour to be achieved for a data manipulation which is in factsplit over multiple atomic-add-with-carry instructions.

The carry value may be provided in a variety of different ways, such asan explicit return operand or via a carry out flag. In some embodiments,the atomic-add-with-carry instruction may also have a carry-in operandwhich is added to the addend operand before this is in turn added to thedata value. Thus, the atomic instruction may in some embodiments haveboth a carry out value and a carry in value.

One form of example use of the atomic-add-with-carry instruction iswithin an apparatus that includes a cache memory to store the datavalue. If the data value is not present within the cache memory, then asequence of atomic-add-with-carry instructions may accumulate a localsum value of the respective addend operand values within the apparatuswith this local sum value then being added to the data value when thedata value becomes available in the cache memory. Thus, the completionof execution of at least some of the atomic-add-with-carry instructionsneed not be delayed awaiting the data value being fetched into the cachememory.

In some example embodiments of the above type of system, the sequence ofatomic-add-with-carry instructions may be from respective programthreads executing upon the apparatus. Within such systems, a givenatomic-add-with-carry instruction that accumulates its addend operandvalue to the local sum value may be returned a carry value such that thegiven atomic add-with-carry instruction may be completed and so permitexecution to advance to execute further program instructions within thegiven program thread which contain the given atomic-add-with-carryinstruction.

One way of ensuring that the final outcome of the sequence ofatomic-add-with-carry instructions matches the intended external view ofexecution of that sequence is to delay returning a final carry for afinal atomic-add-with-carry instruction within the sequence until thedata value is available within the cache memory and the local sum valuehas been added to that data value in order to generate the final carryvalue.

Another example use of atomic-add-with-carry instructions is within asystem which coalesces a plurality of such instructions to generate alocal sum value, returns carry values to all but one of the inputinstructions, generates an output atomic-add-with-carry instructionwhich is then performed by a further processing apparatus from which areceived carry out value is received and passed back to the instructionsource for the instruction which has not yet received its carry outvalue. In this way, the workload of performing the adds may bedistributed and early return values generated to at least some of theinstruction sources, thereby permitting those instruction sources tostart to perform other processing operations rather than waiting for adelayed return value depending upon processing performed elsewhere.

In some embodiments of the disclosure the received carry out value isthe return value for the output atomic-add-with-carry instruction whichwas generated by the apparatus which coalesced the givenatomic-add-with-carry instruction and the one or more furtheratomic-add-with-carry instructions.

The given input and further input atomic-add-with-carry instructions mayform part of a sequence and within such an arrangement, the finalinstruction within the sequence may be held and associated with thereceived carry out value returned from the further processing apparatus.

The techniques of this disclosure may be usefully used when theapparatus is part of a coalescing tree to coalesce atomic-add-with-carryinstructions as formed by a plurality of processing apparatus branchingfrom a root processing apparatus with that root processing apparatusstoring a data value to which the atomic adds are to be accumulated.

FIG. 1 schematically illustrates the data processing apparatus 2including a register file 4, arithmetic/logic circuitry 6, a load storeunit 8, an instruction fetch unit 10, an instruction pipeline 12 and aninstruction decoder 14. In operation, program instructions are fetchedby the instruction fetch unit 10 and passed to the instruction pipeline12. When the program instructions reach the decode stage within theinstruction pipeline 12, then the decoder 14 decodes these programinstructions to generate control signals which control thearithmetic/logic circuitry 6, the load/store unit 8 and the registerfile 4 to perform processing operations as specified by the programinstructions. These processing operations may include load operationsand store operations performed by the load store unit 8 upon data valuesheld within a storage unit, such as a memory. The memory may be a localcache memory, or a higher level within a hierarchical memory system.

The load store unit 8 and the arithmetic/logic unit 6 serve to providean atomic-add-with-carry instruction which serves to add an addendoperand value to a data value stored at a specified memory address in anatomic fashion (e.g. in a manner consistent with the execution of theinstruction having exclusive access to that data value duringexecution). It will be appreciated that in the context of the presentdisclosure, references to add instructions also encompass subtractioninstructions as a modified form of add instructions (e.g. adding a two'scomplement value). Accordingly, references to add instructions shouldalso be considered to include subtraction instructions and anatomic-add-with-carry instruction corresponds to operations which areboth additions and subtractions.

FIG. 2 schematically illustrates a data processing system 16 whichincorporates multiple processors of the form of FIG. 1 namely processors18, 20, 22, 24, each having a respective local cache memory 26, 28, 30,32. The local cache memories 26, 28, 30, 32 cache data values from ashared memory 34. Coherency control circuitry 36 serves to performcoherency control operations so as to manage data coherency between thedifferent versions of a data value which may be stored by the sharedmemory 34 and the respective local cache memories 26, 28, 30, 32. Theprocessors 18, 20, 22, 24 include processing circuitry which performsdata processing operations (e.g. in the context of FIG. 1 including theregister file 4, the arithmetic/logic circuitry 6, and the load/storeunit 8) as well as an instruction decoder 14 which serves to decodeprogram instructions. These program instructions includeatomic-add-with-carry instructions which perform, as an atomicoperation, an add of an addend operand value and a data value stored ina memory (such as one of the local cache memories 26, 28, 30, 32, or theshared memory 34) to generate a result value which is stored in thememory add a carry value indicative of whether or not the add performedgenerated a carry out. This carry value may be returned as a returnvalue for the atomic-add-with-carry instruction (or in some embodimentscould be returned as a carry flag value). The atomic-add-with-carryinstruction may also have a carry-in which is added to the addendoperand value with this result then being added to the data value storedin the memory. Thus, the atomic-with-carry instruction may have both acarry in and a carry out.

FIGS. 3 and 4 are diagrams schematically illustrating the use ofatomic-add-with-carry instructions from two different threads of programexecution to accumulate to a value acc. This accumulate value (acc) hasa total range of bit significance greater than the range of bitsignificance which can be accommodated by an individual operand of theatomic-add-with-carry instructions.

Accordingly, the addend operand value and the data value for a givenatomic-add-with-carry instruction have an associated shared range of bitsignificance corresponding to a portion of the larger range of bitsignificance associated with multiple operands. As an example, a 192-bitaccumulate value may be formed from three 64-bit values acc₁, acc₁ andacc₃ (low-to-high bit significance order). Each of these 64-bit valuescorresponds to a range of bit significance within the larger total rangeof bit significance corresponding to the 192-bit accumulate value.

As illustrated in FIG. 3, both a first 192-bit addend and a second192-bit addend may be accumulated into a starting 192-bit accumulatevalue. Each of the addends is formed of three 64-bit values. These arerepresented in FIG. 3 as values av_(xy), where _(x) corresponds to athread identifier and _(y) corresponds to a number indicating whichrange of bit significance is being represented by that addend value.

The shared range of bit significance of the individual portions of theaddends and the accumulate value, together with the bit significance ofthe total range of bit significance may be represented by metadataassociated with each of these entities. This metadata may be set so asto represent the bit significance of the values within a larger overallpossible range of bit significance. The metadata effectively indicates awindow into this larger overall range of bit significance which isprovided by the individual and collective operands illustrated in FIG.3. The collective larger value comprising the individual operands ofFIG. 3 may itself be a small portion of the maximum range of bitsignificance which can be represented by appropriate use of the metadatavalues.

Returning to the example of FIG. 3, the addition of the first addend andthe second addend into the accumulate value is an associative operation,namely it does not matter whether the first addend value is added to theaccumulate value before the second addend value is added to theaccumulate value or visa versa. Furthermore, the ordering of when theadditions are performed may change between the different shared rangesof bit significance. Thus, for bit significance range A, the relevantportion of the first addend av₁₁ is added to the corresponding sharedbit significance range portion of the accumulate value acc₁ in step S1 ₁as the first action. A carry out from this addition is generated assignal c₁₁ as step S2 ₁ and added into the next higher bit significanceportion of the addend, namely operand ac₁₂. Following the addition ofStep S1 ₁, the corresponding lowest bit significance portion of thesecond addend av₂₁ is added into the accumulate value (which has alreadybeen modified by addition of the relevant portion of the first addend)at step S3 ₁. A carry out value from this second addition is generatedas value c₂₁ at step S₄₁ and added into the operand av₂₂. Thus, inrespect of the lowest significant portions of the first addend, thesecond addend and the accumulate value, the addition of the first addendportion is performed before the addition of the second addend portion.The addition of the first addend portion corresponds to (acc₁+av₁₁) andgenerates a carry out value c₁₁. After this addition has been performed,then the second addition adds in to the accumulating value acc₁ thevalue av₂₁ and generates a carry out value c₂₁.

In respect of the bit significance B portion of the operations, in thisillustrated example, the order in which the relevant portions of thefirst and second addends are added into the corresponding significanceportion of the accumulate value acc₂ is reversed compared to that bitsignificance portion A. Thus, av₂₂ is added to the accumulate value acc₂at step S1 ₂. A carry out c₂₂ is then generated from this addition atstep S2 ₂ and added into the operand av₂₃. Subsequently, at step S3 ₂,the operand from the first addend av₁₂ is added into the accumulatevalue for bit significance portion B at step S3 ₂. A carry out c₁₂ fromthis addition is generated at step S4 ₂ and added into the operand av₁₃.Thus, in respect of the bit significance portion B, the order in whichthe addends are accumulated into the accumulate value is reversedrelative to bit significance portion A.

Finally in respect of bit significance portion C, the operand av₁₃ fromthe first addend is added to the corresponding portion of the accumulatevalue acc₃ at step S1 ₃ and generates a carry out c₁₃ at step S2 ₃.Then, the operand av₂₃ is added at step S3 ₃ to the accumulate valueacc₃ and generates a carry out c₂₃ at step S4 ₃. Thus, the order inwhich the operands are added into the accumulate value is the same asfor bit significance range A and opposite to that of bit significancerange B.

Each of the additions illustrated in FIG. 3 into the accumulate value isperformed as an atomic-add-with-carry operation specified by to anatomic-add-with-carry instruction. The associative nature within eachbit significance range portion has the effect that the ordering of theadditions within each bit significance range portion may be variedwithout influencing the final result. The carry out values from eachaddition are supplied to the next partial addend operand, which forms aportion of the total addend, and are added into that partial addendoperand before its own addition is performed. This variation in theordering which may be used in the different bit significance rangeportions has the effect that partway through the total calculation, thevalues represented by the accumulate operands acc₁, acc₂ and acc₃ (thein-memory representation) may not represent any true meaningful value,but at the end when all of the atomic-add-with-carry instructions havebeen executed, then the final result within the total accumulate value(acc₃: acc₂: acc₁) stored within memory will be correct and all carrieswill have been appropriately reflected. Since the in-memoryrepresentation may not be meaningful at all times, there are some usecases which cannot use the present techniques, yet there are asignificant number of other use cases where this issue is notproblematic and the present techniques may be usefully employed.

Other example embodiments may use addends and an in-memory accumulatorthat have different bit widths, e.g. 64-bit addends into a 192-bitaccumulator. In this case often a single 64-bit AADDC will suffice whenthere is no carry out. Occasionally two AADDC instructions will beneeded when there is one carry and rarely three AADDC instructions whenthere are two carries. These situations are a special case of thearrangement of FIG. 3 where an “early-out” is supported when no carry isgenerated.

FIG. 4 schematically illustrates the execution of atomic-add-with-carryinstructions (AADDC) for each of the two threads correspondingrespectively to the first addend values and the second addend values ofFIG. 3. The execution of these program instructions is shown relative toa time line. In the example of FIG. 4, the least significanceatomic-add-with-carry instruction for the first thread, namely AADDDC₁₁,is executed first. This is then followed by the atomic-add-with-carryinstructions for the least significance bit portion and the middlesignificance bit portion, namely AADDC₂₁ and AADDC₂₂, for the secondthread. Following this, the atomic-add-with-carry instructions for themiddle significance portion and the most significance portion of thefirst thread, namely AADDC₁₂ and AADDC₁₃, are performed. Finally, theatomic-add-with-carry instruction for the most significant portion ofthe second thread, namely AADDC₂₃, is performed. Within each thread, theatomic-add-with-carry instructions are performed in their bitsignificance order with a carry signal propagating therebetween asrequired. Between the threads, the ordering is associative and may bevaried to suit the requirements of the system. The associative behaviourbetween thread allows greater freedom in instruction scheduling.

FIGS. 5A to 5G schematically illustrate the use of atomic-add-with-carryinstructions in the context of a system having a cache memory 38 servingto store cache lines of data from the shared memory 34. An input queue40 contains a queue of atomic-add-with-carry instructions (AADDC) fromrespective program threads A, B, C and D. Processing circuitry 42 servesto execute these AADDC instructions when the data value being added into is not present within the cache 38 by accumulating into a local sumvalue 44 using a local accumulator 46. Instructions awaiting theirreturn operands are parked within a parking operations queue 48. When aninstruction has executed and has all its return operands, then it issent to an output queue 50.

FIG. 5A illustrates the situation in which the addend value 0x01 forThread A is added, the initial local sum value 44 of 0x00 (“0x”indicates a hexadecimal number). The data value from the shared memory34 is not present within the cache 36.

FIG. 5B illustrates execution of the second atomic-add-with-carryinstruction AADDC-0x10 from Thread B, which has an addend operand value0x10 and which is accumulated into the local sum 44. The AADDCinstruction for Thread A is parked within the parked operations queue48.

FIG. 5C illustrates the next cycle at which the return carry value forThread A is returned indicating no carry occurred. At the same time, theaddend operand value 0xF0 for the Thread C is added to the local sumvalue 44 0x11 to generate an updated local sum value.

FIG. 5D illustrates the Thread D addend operand value 0x04 being addedinto the local sum value 0x01 to generate an updated local sum value0x04. The previous add of FIG. 5C generated a carry and this carry issent as the returned carry output value associated with Thread B asillustrated. Thus, Thread B is associated with a carry out value whichin fact resulted from the addition performed in respect of Thread C.

FIG. 5E illustrates the situation after the final atomic-add-with-carryinstruction for Thread D has been executed and added to the local sumvalue 44 producing result 0x05. At the same time, a return carry outvalue which is to be associated with Thread C, and corresponds to theaddition of the addend operand value for Thread D, is passed out and isa zero. This leaves as a parked operation Thread D.

It will be appreciated that each of the atomic-add-with-carryinstructions for Threads A, B, C and D have been performed with respectto a local sum value, but not yet with respect to the data value storedin the shared memory 34 as intended. FIG. 5F illustrates how the datavalue 0xFC is returned from the shared memory 34 to the cache memory 38and then the local sum value 0x05 is added to this so as to generate thereturn carry out value for the final parked operation from Thread D. Inthe example of FIG. 5G, the final result stored into the data value(currently held within the cache 38) results in the data value 0x01 witha final carry out value of 1.

Thread D is the final thread in the sequence of threads and return ofits return carry out value is delayed until the final addition with thedata value has been performed. The other Threads A, B, C have returncarry out values supplied to them in advance of the final addition withthe data value being performed and accordingly these threads may bereleased to perform further processing operations earlier than if theyhad waited for the data value to be returned to the cache 38. Thus theaddend operands are accumulated within a local sum value 44 and returncarry out values returned for all but the final instruction. When thedata value becomes available, then the local sum value is added to it,and the final return carry out value can be generated and returned forthe final instruction of Thread D.

FIG. 6 schematically illustrates use of the atomic-add-with-carryinstructions in another context. This context is a coalescing tree witha 2:1 fan-in for performing the atomic-add-with-carry instructions. Eachof the coalescing nodes within the coalescing tree of FIG. 6 representsa processing apparatus which receives a plurality ofatomic-add-with-carry instructions from a processing apparatus at ahigher level in the hierarchy (as illustrated in FIG. 6). At the highestlevel in the hierarchy are processing apparatuses P₁₀ to P₁₇ which eachoutput an atomic-add-with-carry instruction. The next level iscoalescing processors P₂₀ to P₂₃. The next level is coalescingprocessors P₃₀ and P₃₁. Finally, among the coalescing processors, isprocessor P₄₀. The processing apparatus which holds the data value 52 towhich the accumulate (add) are to be made is processor P_(root). Thecoalescing processors at the nodes between the highest level in thehierarchy are FIG. 6 and the root of the hierarchy of FIG. 6 eachreceive two atomic-add-with-carry instructions and generate one outputatomic-add-with-carry instruction which is passed down to the next lowerlevel within the hierarchy, i.e. a 2:1 fan-in. A return carry outoperand value is returned from each coalescing node to the firstinstruction source from which it received an atomic-add-with-carryinstruction when the node has formed a local sum value from the addendoperands of its received atomic-add-with-carry instructions. The secondof the atomic-add-with-carry instructions is not supplied with itsreturn carry out value and is held until a return carry out value isreceived by that node in respect of the output atomic-add-with-carryinstruction which it generated. The fan-in illustrated in FIG. 6 is 2:1,but it will be appreciated that higher levels of fan-in could besupported if desired.

Each coalescing node within the coalescing tree of FIG. 6 receives botha given input atomic-add-with-carry instruction (the one for which thereturn will be delayed) and one or more further inputatomic-add-with-carry instructions. The node performs a local additionof the given addend operand value for the given inputatomic-add-with-carry instruction and the one or more further inputaddend operand values for the one or more further inputatomic-add-with-carry instructions to generate a local sum value and oneor more local carry out values. These one or more local carry out valuesare sent as respective return values to the one or more furtherinstruction sources of the one or more further instructions. The nodegenerates an output atomic-add-with-carry instruction specifying thelocal sum value as its addend operand value and passes this to the nextlower level within the coalescing tree of FIG. 6. The node then waitsuntil it receives a received carry out value in respect of this outputatomic-add-with-carry instruction and when this is received, it is sentas a return value to the given instruction source which is theinstruction source which was waiting for its return carry out value.

At the root level within the coalescing tree the node P_(root) receivesa single atomic-add-with-carry instruction which it performs atomicallyupon the stored data value 52 and generates a return value. The addendoperand value for the atomic-add-with-carry instruction received by theroot node P_(root) is a sum of all the addend operands for the nodes P₁₀to P₁₇ in the hierarchy. Carry out values in respect of carriesgenerated during formation of this local sum value have already beenreturned.

As illustrated in FIG. 6, the nodes P₂₀ to P₂₃ each perform their localsum and coalescing operation and generate respective return carry outvalues for one of their two instruction sources. These return carry outvalues that are numbered 1, 2, 3, 4 in FIG. 6. The remaining returncarry out values for these nodes are not generated at this time, and therelevant threads in the higher level of the tree are held awaitingreturn of those return values.

At the next lower level in the hierarchy, the nodes P₃₀ and P₃₁ eachreceive two atomic-add-with-carry instructions from the level above andagain perform a local sum operation generating one return carry outvalue illustrated as return carry out values 5, 6, with the other carryout values being held.

The return value 5 is sent to node P₂₁ and can then serve to generatethe return value 7 which is sent from node P₂₁ to node P₁₂. The returnvalue 6 received at node P₂₃ is used to serve as the return value 8which is sent from node P₂₃ to node P₁₆.

The final coalescing level within the coalescing tree of FIG. 6corresponds to node P₄₀. This receives atomic-add-with-carryinstructions from nodes P₃₀ and P₃₁. Node P₄₀ performs a local add ofthe addend operands of each of the received atomic-add-with-carryinstructions and generates a return carry out value 9 which is passedback to node P₃₁. The return carry out value 9 can then propagate vianode P₃₁ and node P₂₂ to serve as return carry out value 11 sent to nodeP₁₄.

The final coalescing node P₄₀ generates an output atomic-add-with-carryinstruction which is sent to the route node P_(root) where it is addedto the data value 52 and generates a return carry out value 12 which isreturned to node P₄₀. The return value 12 propagates via nodes P₃₀ andP₂₀ to reach node P₁₀ as carry 15.

In overall operation it will be seen that each of the original sourcenodes P₁₀ to P₁₇ eventually receives a return carry out value. Earlyreturn carry out values are received by a significant proportion ofthese instruction sources (nodes) allowing them to continue with otherprocessing before the final addition is performed at the root nodeP_(root).

FIG. 7 is a flow diagram schematically illustrating the operation of oneof the coalescing nodes of FIG. 6 (i.e. with a 2:1 fan-in). At step 54processing waits until a first atomic-add-with-carry instruction isreceived. Processing then waits at step 56 until a secondatomic-add-with-carry instruction is received. Step 58 performs a localadd and generates a local carry. Step 60 returns the local carry to thesecond instruction source. Step 62 then sends an outputatomic-add-with-carry instruction specifying the local sum calculated atstep 58 to the next processor lower within the coalescing treehierarchy. Step 64 waits until a return carry out value is received atthe node. When a return carry out value is received, then step 66forwards this return carry out value to the first instruction source.

Although illustrative embodiments have been described in detail hereinwith reference to the accompanying drawings, it is to be understood thatthe claims are not limited to those precise embodiments, and thatvarious changes, additions and modifications can be effected therein byone skilled in the art without departing from the scope and spirit ofthe appended claims. For example, various combinations of the featuresof the dependent claims could be made with the features of theindependent claims.

1. Apparatus for processing data comprising: processing circuitry toperform processing operations specified by program instructions; and aninstruction decoder to decode an atomic-add-with-carry instruction tocontrol said processing circuitry to perform as an atomic operation anadd of an addend operand value and a data value stored in a storage unitin a manner consistent with exclusive access to said data value duringsaid atomic operation to generate a result value stored in said storageunit and a carry value indicative of whether said add generated a carryout.
 2. Apparatus as claimed in claim 1, wherein said storage unit is amemory mapped storage unit.
 3. Apparatus as claimed in claim 1, whereinsaid storage unit is a memory.
 4. Apparatus as claimed in claim 1,wherein said addend operand value and said data value have a sharedrange of bit significance and at least one of said addend operand valueand said data value is part of a larger value having a total range ofbit significance greater than and including said shared range of bitsignificance.
 5. Apparatus as claimed in claim 4, wherein saidinstruction decoder decodes a further atomic-add-with-carry instructionto perform a further add with a further shared range of bitsignificance, said shared range of bit significance and said furthershared range of bit significance being discrete ranges within said totalrange of bit significance.
 6. Apparatus as claimed in claim 5, whereinsaid further shared range of bit significance is contiguous with saidshared range of bit significance and corresponds to higher order bitswithin said total range of bit significance than said shared range ofbit significance.
 7. Apparatus as claimed in claim 6, wherein said carryvalue generated by said atomic-add-with-carry instruction is added to anaddend operand of said further atomic-add-with-carry instruction. 8.Apparatus as claimed in claim 1, wherein said processing circuitry isformed to return said carry value as a return value when saidatomic-add-with-carry instruction is performed.
 9. Apparatus as claimedin claim 1, wherein said atomic-add-with-carry instruction has acarry-in operand and said add is an add of said addend operand, saiddata value and said carry-in operand.
 10. Apparatus as claimed in claim1, comprising a cache memory to store said data value, wherein, at leastwhen said data value is not present in said cache memory, saidprocessing circuitry is responsive to a sequence ofatomic-add-with-carry instructions to accumulate a local sum value ofrespective addend operand values of a plurality of atomic-add-with-carryinstructions within said sequence and to add said local sum value tosaid data value when said data value is available in said cache memory.11. Apparatus as claimed in claim 10, wherein said sequence ofatomic-add-with-carry instructions are instructions from respectiveprogram threads executing upon said apparatus.
 12. Apparatus as claimedin claim 11, wherein said processing circuitry is formed, upon executinga given atomic-add-with-carry instruction that accumulates a givenaddend operand value for a given program thread into said local sumvalue and generates a given carry value, to return said given carryvalue to said given program thread to permit said given program threadto advance to execute further program instructions within said givenprogram thread following said given atomic-add-with-carry instruction.13. Apparatus as claimed in claim 12, wherein said processing circuitryis formed to delay returning a final carry value for a finalatomic-add-with-carry instruction within said sequence until said datavalue is available in said cache memory and said local sum value isadded to said data value to generate said final carry value. 14.Apparatus as claimed in claim 1, wherein said processing circuitry isformed to respond to a given input atomic add-with-carry instructionspecifying a given addend operand value received from a giveninstruction source and one or more further input atomic-add-with-carryinstructions specifying respective further addend operand valuesreceived from respective further instruction sources to: perform a localaddition of said given addend operand value and said one or more furtherinput addend operand values to generate a local sum value and one ormore local carry out values; send said one or more local carry outvalues as respective return values to said one or more furtherinstruction sources; generate an output atomic-add-with-carryinstruction specifying said local sum value as an output addend operandvalue; send said output atomic-add-with-carry instruction to a furtherprocessing apparatus; receive a received carry out value; and send saidreceived carry out value as a return value to said given instructionsource.
 15. Apparatus as claimed in claim 14, wherein said receivedcarry out value is a return value for said output atomic-add-with-carryinstruction.
 16. Apparatus as claimed in claim 14, wherein said giveninput atomic-add-with-carry instructions and said one or more furtherinput atomic-add-with-carry instructions are an input sequence of inputatomic-add-with-carry instructions and said given inputatomic-add-with-carry instruction is a final input atomic-add-with-carryinstruction within said input sequence.
 17. Apparatus as claimed inclaim 14, wherein said apparatus is part of a coalescing tree tocoalesce atomic-add-with-carry instructions and formed of a plurality ofprocessing apparatus branching from a root processing apparatus storingsaid data value.
 18. Apparatus for processing data comprising:processing means for performing processing operations specified byprogram instructions; and instruction decoding means for decoding anatomic-add-with-carry instruction to control said processing means toperform as an atomic operation an add of an addend operand value and adata value stored in a storage unit in a manner consistent withexclusive access to said data value during said atomic operation togenerate a result value stored in said storage unit and a carry valueindicative of whether said add generated a carry out.
 19. A method ofprocessing data comprising: performing processing operations specifiedby program instructions with processing circuitry; and decoding anatomic-add-with-carry instruction to control said processing circuitryto perform as an atomic operation an add of an addend operand value anda data value stored in a storage unit in a manner consistent withexclusive access to said data value during said atomic operation togenerate a result value stored in said storage unit and a carry valueindicative of whether said add generated a carry out.